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Abstract 

The  papers  [2,  3]  establish  the  connection  between  importance  sam¬ 
pling  algorithms  for  estimating  rare-event  probabilities,  two-person 
zero-sum  differential  games,  and  the  associated  Isaacs  equation.  In  or¬ 
der  to  construct  nearly  optimal  schemes  in  a  general  setting,  one  must 
consider  dynamic  schemes,  i.e.,  changes  of  measure  that,  in  the  course 
of  a  single  simulation,  can  depend  on  the  outcome  of  the  simulation 
up  till  that  time.  The  present  paper  and  a  companion  paper  [4]  show 
that  classical  sense  subsolutions  of  the  Isaacs  equation  provide  a  basic 
and  flexible  tool  for  the  construction  and  analysis  of  nearly  optimal 
schemes.  Asymptotic  analysis  is  the  topic  of  the  present  paper,  while 
[4]  focuses  on  explicit  methods  for  the  construction  of  subsolutions, 
implementation  aspects  and  numerical  results. 
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1  Introduction 


In  a  pair  of  recent  papers  [2,  3],  we  discuss  how  one  can  characterize  the 
optimal  achievable  performance  of  importance  sampling  schemes  in  the  large 
deviation  limit  in  terms  of  a  deterministic  differential  game.  The  value 
function  of  the  game  can,  in  turn,  be  characterized  as  the  solution  to  a 
certain  nonlinear  partial  differential  equation  (PDE)  known  as  an  Isaacs 
equation.  Asymptotically  optimal  importance  sampling  schemes  are  then 
constructed  based  on  this  solution. 

The  purpose  of  the  present  paper  and  a  companion  paper  is  to  explore 
this  connection  in  further  depth.  More  precisely,  we  show  how  one  can 
construct  importance  sampling  schemes  based  on  subsolutions  of  the  Isaacs 
equation.  Since  a  solution  is  always  a  subsolution,  this  leads  to  a  more  gen¬ 
eral  class  of  schemes.  The  main  result  of  the  paper  is  a  basic  result  on  the 
asymptotic  performance  of  importance  sampling  schemes  that  are  based  on 
a  given  subsolution.  The  performance  is  in  fact  characterized  by  the  value 
of  the  subsolution  at  a  particular  point.  The  proof  is  carried  out  in  a  general 
setting  that  contains  as  special  cases  sums  of  independent  identically  dis¬ 
tributed  (iid)  random  variables  and  the  empirical  measure  of  a  finite-state 
Markov  chain.  However,  its  potential  application  is  much  broader,  and  in¬ 
cludes  systems  with  state  dependencies  and  small  noise  effects,  solutions  to 
stochastic  differential  equations,  systems  with  constrained  dynamics  (e.g., 
queuing  networks),  and  to  different  forms  of  the  expected  value  (e.g.,  prob¬ 
abilities  of  path  dependent  events).  Some  of  these  developments  will  be 
reported  elsewhere. 

One  is  often  interested  in  properties  other  than  just  asymptotic  optimal¬ 
ity  (e.g.,  ease  of  construction,  ease  of  implementation).  It  turns  out  that 
one  can  often  construct  subsolutions  that  have  a  much  simpler  structure 
than  the  actual  solution,  and  which  induce  schemes  that  are  asymptotically 
optimal.  This  is  important  since  the  simplicity  of  the  subsolution  is  usually 
reflected  in  the  schemes  they  generate.  It  is  therefore  important  to  develop 
flexible  techniques  for  the  construction  of  subsolutions.  That  is  the  topic  of 
the  companion  paper  [4],  which  also  presents  some  numerical  results  for  the 
broader  class  of  applications  mentioned  in  the  last  paragraph. 

The  paper  is  organized  as  follows.  Since  the  underlying  game  and  Isaacs 
equation  are  not  yet  widely  exposed  in  the  importance  sampling  litera¬ 
ture,  we  give  some  heuristics  and  a  formal  overview  in  Section  2  in  the 
setting  of  sums  of  iid  random  variables.  In  particular,  we  formally  derive 
the  Isaacs  equation,  and  indicate  why  subsolutions  to  this  equation  both 
suggest  schemes  and  serve  as  a  basic  tool  in  their  analysis.  In  Section  3 
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the  general  model  and  assumptions  are  stated.  Importance  sampling  for 
Markov  chains  uses  a  collection  of  eigenfunctions  that  are  related  to  the 
transition  kernel  of  the  chain,  and  Section  4  reviews  the  properties  of  these 
eigenfunctions.  Section  5  identifies  the  Isaacs  equation  appropriate  for  the 
class  of  importance  sampling  problems  introduced  in  Section  3,  and  Section 
6  constructs  the  importance  sampling  schemes  that  are  associated  with  par¬ 
ticular  subsolutions.  The  main  result  of  the  paper  analyzes  the  asymptotic 
variance  of  a  scheme  associated  with  a  given  subsolution,  and  characterizes 
the  performance  of  the  scheme  in  terms  of  the  value  of  the  subsolution  at 
a  particular  point.  This  result  is  stated  and  proved  in  Section  7.  Finally,  a 
tightness  result  needed  for  the  asymptotic  analysis  is  proved  in  the  appendix. 

Notation.  For  a  Polish  space  S,  V{S)  denotes  the  collection  of  all  proba¬ 
bility  measures  on  {S,B{S)),  where  B{S)  is  the  Borel  u-algebra.  There  will 
be  many  instances  in  this  paper  where  we  decompose  measures  on  a  product 
space  as  the  product  of  a  marginal  distribution  and  a  stochastic  kernel.  The 
following  notation  will  be  used.  Suppose  that  [i  G  V{Si  x  S2)  (with  each  Si 
a  Polish  space)  is  such  a  probability  measure.  Then  [ij]i  will  denote  the  first 
marginal  of  jj,,  and  n{dy2\yi)  will  denote  the  stochastic  kernel  on  S2  given 
Si  such  that  n{dyi  x  dy2)  =  [y\i{dyi) n{dy2\yi) ■  Quantities  such  as  [y\2i 
yL{dyi\y2),  and  the  extension  to  products  of  more  than  two  Polish  spaces 
are  all  defined  in  the  analogous  fashion.  Given  y  G  ^*(51)  and  a  stochastic 
kernel  q  on  S2  given  Si,  we  let  y®  q  denote  y{dyi)q{dy2\yi)  G  P(S'i  x  52). 


2  An  Introduction  to  the  Role  of  Subsolutions 

This  section  describes  how  an  Isaacs  equation  arises  in  importance  sampling, 
how  subsolutions  to  that  equation  can  be  constructed,  how  they  induce 
importance  sampling  schemes,  and  the  implications  for  performance  of  the 
schemes.  Since  it  is  an  overview,  we  will  not  give  all  details  and  will  not 
be  precise  regarding  all  necessary  assumptions.  The  overview  is  provided  in 
the  simplest  possible  setting:  sums  of  iid  random  variables.  The  rest  of  the 
paper  and  the  companion  paper  will  consider  more  elaborate  models. 


2.1  Problem  formulation  for  sums  of  iid  random  variables 


Consider 


Xn  =  -y^Yi, 

n  ^ 


i=l 
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where  the  {Yi,i  G  N}  are  iid  with  distribution  fj,.  For  a  G  let 

H{a)  =  log[  fj,{dy), 

jRd 

where  we  assume  exp  {a,  y)  y{dy)  <  oo  for  each  such  a.  Consider  also 
the  Legendre  transform 

L(/3)  =  sup  [{a,  (3)  —  H{a)]  . 

Importance  sampling  is  a  Monte  Carlo  method  for  the  estimation  of 
expected  values.  One  samples  from  a  distribution  that  may  differ  from  the 
true  distribution,  and  in  order  to  guarantee  that  the  resulting  estimate  is 
unbiased  one  multiplies  each  sample  by  the  appropriate  Radon-Nikodym 
derivative.  The  goal  is  then  to  choose  the  sampling  distribution  so  that  this 
estimate  has  low  variance.  Suppose  the  functional  of  interest  is 

Eeyi.Y>{-nF{Xn)}  ■ 

In  the  context  of  sums  of  iid  random  variables,  one  typically  uses  the  fol¬ 
lowing  parametric  family  of  exponential  changes  of  measure  to  generate  the 
replacements  for  the  1^: 

yo.{dy)  = 

In  constructing  the  replacement  for  we  use  a  dynamic  change  of  measure. 
For  a  function  6i{x,  t)  :  x  [0, 1]  ^  recursively  define  the  following 

quantities.  Let  Xq  =  0,  and  assume  that  X^,YJ^,j  =  have  been 

defined.  Let  conditioned  on  X^,YJ^,j  =  have  distribution 

\^a{X'y,iln)i  then  set  When  X"^ ^Y^  have  been 

defined  for  alH  =  1, . . . ,  n,  set 

n— 1 

^  ^-nF{Xl)  g//(a(X~,Vn))-(a(X",i/n),L4i)_ 
i=0 

It  is  easy  to  check  that  EZ'^  =  Ee~'^^^^'^\  and  so  the  average  of  K  indepen¬ 
dent  samples  of  Z""  converges  almost  surely  to  as  K  —>  oo.  Since 

the  estimator  is  unbiased,  to  minimize  the  variance  one  can  minimize  the 
second  moment,  and  to  do  this  it  is  enough  to  minimize  the  second  moment 
of  the  single  sample  Z"'. 

We  consider  the  problem  of  minimizing  the  second  moment  as  a  control 
problem,  with  a  the  control.  It  is  here  that  the  problem  connects  naturally 
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with  a  PDE.  To  make  the  connection  we  must  extend  the  problem  slightly. 
For  i  G  NU{0}  and  x  G  define  X^,j  =  i,. . . ,  n  —  1  as  above  save  =  x, 
and  then  define 

-  2 

It  will  be  more  convenient  to  express  this  in  terms  of  the  original  random 
variables: 

n—1 

j=i 

Owing  to  the  exponential  scaling  in  n,  one  gets  a  simple  asymptotic  problem 
by  considering  the  logarithmic  transform 

W^ix^i)  =  — —  logF"'(x,i). 
n 

The  performance  of  the  scheme  corresponding  to  a  can  then  be  character¬ 
ized  in  terms  of  liminf^^oo  kF"'(0,  0),  with  larger  values  indicating  better 
performance. 

2.2  The  associated  Isaacs  equation 

F"'  is  the  value  function  of  a  discrete  time  stochastic  control  problem,  and 
as  such,  satisfies  the  dynamic  programming  equation 

“  UM.d- 

A  variational  formula  involving  relative  entropy  (see  [1,  Section  1.4]  and 
below)  shows  how  to  represent  exponential  integrals  in  terms  of  relative 
entropy.  For  7  G  with  7  <C  ^  and  log  {d'y/dix)  integrable  with  respect 

to  7  set 

and  otherwise  set  R{'y\\ix)  =  00.  Then  for  any  bounded  and  continuous 
function  /  :  ^  M, 

-log  /  =  inf  R{'y\\ix)+  [  f{y)l{dy)  . 

7eP(K‘*)  L 


n—1 

F”(x,t)  =  infF  TT 

j=i 
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Applying  this  to  the  dynamic  programming  equation  and  using  the  definition 
of  hh”  gives  the  following  discrete  time  Isaacs  equation; 

W'^{x,i)  =  sup  inf  [  (x  +  ^,i  +  l) -f{dy) 

LdRd  V  n  / 

+  - f'R(7ll/^)  +  /  {<^,y)lidy)  -  H{a)]  . 

n  V  dRd  J\ 

To  formally  relate  W'^{x,i)  to  the  solution  of  a  PDE,  suppose  that  for  a 
smooth  function  IT  :  x  [0, 1]  — >  M, 

W'^{x,i)  K,  W{x,i/n). 

We  also  use  the  following  relationship  between  relative  entropy  and  the 
function  L  defined  as  the  Legendre  transform  of  H  (see  [1,  Section  C.5]). 
For  any  f5 

inf  R{-f\\y):  f  yj{dy)  =  f3  =  L{f3).  (2.1) 

jRd 

(It  in  fact  turns  out  that  the  infimizing  7  is  of  the  form  for  the  point  a 
that  is  conjugate  to  f3  in  the  sense  of  convex  duality.  It  is  for  this  reason 
that  the  class  of  “exponential  tilts”  is  asymptotically  optimal.)  We  then 
bring  W"'{x,i)  ^  W{x,ifn)  to  the  right  side  of  the  Isaacs  equation,  expand 
via  Taylor  series,  insert  the  relation  above  and  then  multiply  by  n  and  send 
n  ^  00  to  get 

Wt{x,t)+  sup  inf  M{DW{x,t);  a,  P)  =  0. 

QgRd 

Here  Wt  denotes  the  partial  derivative  with  respect  to  t,  DW  the  gradient 
in  X,  and 

H(s;  a,  P)  =  (s,  P)  +  L{P)  +  (a,  P)  -  H{a)  (2.2) 

for  s,  a,  /3  G  Note  that  also  one  expects  the  terminal  condition  IT(x,  1)  = 
2F{x)  to  hold. 

This  PDE,  which  is  also  known  as  an  Isaacs  equation,  was  identified  in 
[2]  and  used  there  to  study  the  performance  of  certain  importance  sampling 
schemes.  However,  the  purpose  of  the  present  paper  is  to  show  that  it  is  only 
the  subsolution  property  that  is  essential.  By  a  classical  sense  subsolution, 
we  mean  a  function  W  :  x  [0, 1]  — M  with  a  smooth  extension  to  an  open 

neighborhood  of  x  [0, 1]  such  that 

Wt{xp)  +  sup  inf  Jl{DW{xp)]a,P) 

agffid  /36R'* 
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for  all  {x,t)  and  W{x,  1)  <  2F{x).  We  also  consider  a  priori  fixed  change  of 
measure  controls  a{x,t),  and  call  (W,a)  a  subsolution/control  pair  if 

Wt{x,t)  +  inf  'E[{DW{x,t);  a{x,t),  P)  >  0 

for  all  {x,t)  and  W{x,l)  <  2F{x).  The  definition  of  a  snbsolution  simply 
replaces  the  equality  that  appears  in  the  Isaacs  eqnation  and  terminal  con¬ 
dition  with  inequalities.  However,  we  are  only  interested  in  bounding  the 
quantity  IT” (0,0)  from  below,  with  an  upper  bound  being  antomatic  from 
the  fact  that  the  best  possible  performance  is  bonnded.  The  ineqnalities  in 
the  dehnition  are  those  which  give  lower  bounds  when  the  smooth  snbsoln- 
tion  is  combined  with  a  verification  argnment  to  estimate  the  performance. 

Remark  2.1  The  snpremum  and  infimum  in  the  Isaacs  equation  can  be 
evalnated  to  give 

Wt  -  2H{-DWI2)  =  0. 

This  eqnation  immediately  suggests  the  form  of  certain  simple  but  impor¬ 
tant  solntions  to  the  Isaacs  equation-see  the  next  section  and  the  discussion 
in  Section  3.1  of  [4].  However,  the  analysis  of  a  specific  proposed  impor¬ 
tance  sampling  scheme  reqnires  the  eqnation  and  definition  given  above  for 
a  subsolntion/control  pair. 

In  the  remainder  of  this  motivational  section  we  give  several  simple  ex¬ 
amples  of  snbsolutions  and  discuss  how  the  Isaacs  equation  gives  bounds  on 
the  second  moment  of  the  associated  schemes. 

2.3  Two  simple  examples 

Example  1.  Let  F  be  convex  and  bonnded  from  below.  Interchanging  the 
supremnm  and  infimnm  in  the  Isaacs  eqnation  and  evalnating  the  snpremum 
on  a  gives 

Wt{x,t)+  inf  [{DW{x,t),P)  +  2L{P)]=0. 

The  viscosity  solntion  to  this  PDE  and  terminal  condition  is  well  known, 
and  indeed 


W{x,t)=  inf  [2{l-t)L{P)  +  2F{x  +  {l-t)P)]. 

(Strictly  speaking,  this  solution  need  not  be  smooth.  We  will  not  concern 
ourselves  with  such  issues  in  this  overview,  bnt  note  that  all  the  snbsolutions 
we  work  with  will  be  classical  sense,  smooth  subsolutions.) 
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Although  it  is  easy  in  this  example  to  construct  the  exact  solution,  one 
may  wish  to  obtain  a  subsolution  that  will  generate  simpler  importance 
sampling  schemes  with  the  same  asymptotic  performance.  As  we  will  see,  the 
property  that  is  needed  so  that  the  asymptotic  performance  of  a  subsolution 
W  is  optimal  is  14^(0, 0)  =  bb(0, 0).  Let  /3*  achieve  the  infimum  in  the 
definition  of  bL(0, 0),  and  let  a*  satisfy 

L{p*)  =  {a* ,  P*)  -  H  (a*) 

(i.e.,  a*  is  conjugate  to  /?*).  Let 

W{x,  t)  =  -2  {a*,x)  +  2tH  {a*)  +  2[L  (/?*)  +  F  (/?*)]  . 

Then  Wt{x,t)  =  2H  {a*)  and  DW{x,t)  =  -2a*.  Since 

Wt{x,t)+  sup  inf  M.{DW{x,t)]a,  P) 

QgRd  /3elR‘* 

=  2H{a*)+  inf  sup  [(— 2q;*,/3)  +  L(/3)  +  (a, /?)  —  Lf(a)] 

/3eILd  agRd 

=  2H{a*)+  inf  [{-2a* ,  P)  +  2L{P)] 

/3eiR'i 

=  2H{a*)-2sn-p[{a*,P)  -  L{P)] 

/3eR'* 

=  2H  {a*)  -  2H  {a*) 

=  0, 

we  have  only  to  check  the  terminal  condition.  To  simplify  we  assume  L 
is  differentiable  at  /?*,  a  very  mild  condition.  Then  one  can  verify  that 
W{x,  1)  is  a  supporting  hyperplane  to  2F  at  P* ,  and  so  W{x,  1)  <  2F  (rc). 
Note  also  that  1T(0,0)  =  2  [L  (/?*)  + F  (/?*)]  =  fT(0,0).  Thus  W  achieves 
the  maximum  possible  value  among  all  subsolutions  at  (0,0),  with  a  much 
simpler  structure  than  the  true  solution. 

Evaluating  the  infimum  on  P  first,  we  find  that  in  the  case  of  the  exact 
solution 


sup  inf  M.{DW{xp);a,  P) 

aeRd 0&Rd 


=  sup  —  sup  [—L{P)  +  {—DW (x,  t)  —  a,P)  +  H{a)\ 

a&Rd  y/3eRd 

=  —  inf  [H{—DW{x,t)  —  a)  +  F[{a)]. 

a&Rd 

By  convexity  the  supremum  on  a  is  achieved  at  a{x,t)  =  —DW{x,t)/2. 
This  is  the  importance  sampling  control  that  is  naturally  suggested  by  the 


exact  solution.  For  the  affine  subsolution  the  supremum  is  at  a{x,  t)  = 
—DW{x,t)/2  =  a*.  The  corresponding  very  simple  importance  sampling 
scheme  is  well  known  in  the  literature. 

Example  2.  Here  we  take  F{x)  =  Fi{x)  A  F2{x),  where  each  Fj  is  convex 
and  bounded  from  below.  Although  the  exact  solution  takes  the  same  form 
as  in  Example  1,  there  may  not  be  an  affine  subsolution.  However,  it  is 
natural  to  consider  the  minimum  of  the  affine  subsolutions  associated  with 
each  of  the  Fj.  Thus  let 

[L  iP*)  +  F  m]  =  inf  [m  +  Fi  iP)]  , 
pmd- 

let  a*  be  the  convex  conjugate  of  P* ,  and  set 

Wi{x,  t)  =  -2  (alx)  +  2tH  {a*)  +  2  [L  (P*)  +  F  (P*)]  ■ 

Let  W{x,t)  =  Wi{x,t)  A  W2{x,t)-  Ignoring  the  issue  of  what  is  meant  at 
points  where  W  is  not  differentiable,  this  provides  a  subsolution.  One  can 
check  that 


1T(0, 0)  =  2  Ail  [L  iPt)  +  F  iP*)]  =  IT(0,  0). 

To  produce  a  smooth  subsolution,  it  turns  out  that  one  can  simply  mol¬ 
lify  W{x,t)  [4].  Since  W  is  identified  as  the  pointwise  minimum  of  smooth 
(affine)  functions,  we  use  the  standard  approximation  which  we  will  call 
exponential  weighting.  Let  5  be  a  small  positive  number,  and 

W^{x,t)  =  —(5  log  . 

Define  the  probability  vector  {pi{x,t),  p2{x,t))  by 

Pi{x,t)  =  . 

It  follows  that 


W,\x,t)  =  piix,t)2H  (al)  +  p2{x,t)2H  {a*2) ,  (2.3) 

DW\x,t)  =  -pi{x,t)2al- p2{x,t)2a*2.  (2.4) 

One  can  also  easily  verify  that 

W {x,  t)  >  W^{x,  t)>W {x,  t)  —  6  log  2. 
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Hence  the  value  1H^(0,0)  may  be  slightly  smaller  than  1H(0,0),  but  this 
difference  can  be  made  arbitrarily  small. 

The  optimizing  a  can  be  found  as  in  Example  1: 

a(x,  t)  =  pi{x,  t)al  +  p2{x,  t)a*2. 

There  are  (at  least)  two  ways  that  this  control  can  be  implemented.  The 
first  is  the  one  given  at  the  beginning  of  this  overview,  i.e.,  Pa(x^,iln)  chooses 
the  distribution  of  dt”  i.  The  second  implementation  leads  to  the  notion  of 
generalized  subsolution/eontrol,  which  will  be  discussed  further  in  Section 
5.  With  this  implementation,  one  chooses  between  the  indices  1  and  2, 
conditioned  on  all  the  past  data,  with  weights  pi{Xf  ,i/n)  and  p2{Xf  ,i/n), 
and  then  depending  on  the  outcome  generates  according  to  pai  or 
/Xq*.  The  latter  implementation  has  some  advantages  when  the  underlying 
process  is  more  complicated  than  an  iid  sequence  (i.e.,  a  functional  of  a 
Markov  chain).  Of  course  with  this  implementation  the  Radon-Nikodym 
derivative  takes  a  different  form  than  the  one  given  at  the  beginning  of  this 
section.  See  Section  6. 

Many  more  examples  and  a  more  systematic  approach  to  the  construc¬ 
tion  of  subsolutions  appears  in  [4]. 

Remark  2.2  There  are  other  methods  of  mollification  to  produce  smooth 
subsolutions.  For  example,  one  can  integrate  W  against  a  smooth  convolu¬ 
tion  kernel  with  support  in  the  ball  of  radius  5  around  0.  It  turns  out  that 
the  resulting  approximation  (abusing  notation)  also  satisfies  equations 
(2.3)  and  (2.4)  for  some  probability  vector  {pi{x,t),  p2{x,t)).  However,  the 
computation  of  pi{x,t)  involves  numerical  integration  and  can  be  computa¬ 
tionally  demanding.  In  contrast,  for  the  exponential  weighting  mollification 
we  use,  the  {pi{x,t)}  are  easy  to  compute. 

2.4  Performance  of  the  schemes 

Finally  we  remark  on  the  performance  of  the  importance  sampling  schemes 
so  constructed.  The  main  result  of  this  paper,  which  will  be  proved  for  a 
more  complex  process  model  and  which  can  be  generalized  considerably,  is 
the  following:  If  is  constructed  according  to  the  subsolution/control  pair 
(IT,  a)  and 

=  --logE{Z^f, 
n 

then 

liminfIT”  >  IT(0,0).  (2.5) 
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Moreover,  the  same  result  is  true  for  generalized  subsolutions  and  controls. 

An  elementary  calculation  based  on  Jensen’s  inequality  shows  that  for 
any  importance  sampling  schemes  the  best  performance  possible  is  Vb(0,  0), 
where  W  is  the  exact  solution.  Hence  the  design  problem  becomes  clear. 
Construct  a  subsolution/control  pair  which  can  be  implemented  with  rea¬ 
sonable  effort  and  for  which  H^(0, 0)  is  acceptably  close  to  hF(0,0). 

The  remainder  of  this  paper  is  devoted  to  the  precise  statement  and 
proof  of  (2.5)  is  a  more  general  setting. 

3  The  General  Setup 

The  broader  collection  of  importance  sampling  problems  we  wish  to  analyze 
includes  sums  of  independent  and  identically  distributed  (iid)  random  vari¬ 
ables  and  sums  of  functionals  of  a  finite  state  Markov  chain.  The  following 
general  model  includes  both  as  special  cases.  Let  Y  =  {Yi,i  G  No}  denote 
a  Markov  chain  with  state  space  S.  Assume  that  S'  is  a  Polish  space,  and 
let  p{y,dz)  denote  the  probability  transition  kernel.  Let  {bi{-),i  G  No}  be  a 
sequence  of  iid  random  vector  fields  on  S  that  is  independent  of  the  Markov 
chain  Y.  For  each  y  G  S,  bi{y)  is  distributed  according  to  a  probability 
measure,  say  m{-\y),  on  Our  interest  is  in  sums  of  the  form 

1 

Xn  =  -y^bi{Yi).  (3.1) 

2=1 

By  choosing  S  to  be  a  single  point  we  recover  the  case  of  sums  of  iid  random 
variables,  whereas  taking  bi{y)  to  be  deterministic  [i.e.,  m{-\y)  is  a  single 
atom  for  each  y  G  S]  produces  the  case  of  functionals  of  a  Markov  chain. 
The  general  case  is  also  of  interest,  and  occurs  when  the  distribution  of  the 
summand  bi  is  modulated  by  the  “exogenous”  process  Y. 

Remark  3.1  In  the  literature  on  importance  sampling  for  Markov  chains 
it  is  standard  to  include  the  initial  state  To  =  y  in  the  sample  mean.  The 
sole  reason  to  consider  the  sum  from  i  =  1  to  n,  as  in  the  definition  (3.1) 
of  Xn,  is  that  it  significantly  simplifies  our  notation  in  later  analysis.  We 
point  out,  however,  that  there  is  no  loss  of  generality,  in  that  all  the  results 
in  this  paper  hold  if  we  replace  definition  (3.1)  by  the  standard  one  where 
the  summation  is  taken  from  i  =  0toi  =  n  —  1. 

Condition  3.1  The  following  conditions  are  assumed  throughout  the  paper. 
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1.  There  is  a  reference  probability  measure  X  on  S,  a  positive  integer  mo, 
and  6  G  (0, 1),  such  that 

bX{dy2)  <  {yudy2)  <  ^X{dy2) 

for  all  yi  G  S.  Here  is  the  mo-step  transition  kernel  correspond¬ 

ing  to  p. 

2.  The  transition  kernel  p{yi,dy2)  satisfies  the  Feller  property,  i.e.,  the 
mapping  yi'-^  p  {yi,dy2)  is  continuous  in  the  topology  of  weak  conver¬ 
gence  of  probability  measures  on  S. 

3.  The  mapping  y  m{dz\y)  is  continuous  in  the  topology  of  weak  con¬ 
vergence  of  probability  measures  on 

f.  For  each  a  G 

sup  /  m{dz\y)  <  oo. 

y&S  jRd. 

Note  that  parts  1,  2,  and  3  of  Condition  3.1  automatically  hold  when  Y  is 
an  irreducible  finite  state  Markov  chain. 

For  a  pair  of  probability  measures  7,  G  V{S),  we  recall  that  the  relative 
entropy  of  7  with  respect  to  y,  was  defined  as 

RiilM  =  Jhg^di 

if'j  y  and  R{'y\\y)  =  00  otherwise.  The  relative  entropy  R{'y\\y)  is  always 
non-negative,  and  is  a  convex,  lower  semicontinuous  function  of  (7,  y)  G 
V{S)  X  V{S).  We  refer  the  reader  to  [1,  Section  1.4]  for  the  proof  and  other 
properties  of  relative  entropy. 

Under  Condition  3.1,  {Xn,n  G  N}  satisfies  a  large  deviation  principle 
with  the  rate  function 

L{(5)  =  inf  (r  ||0  <8)p)  +  i?  (0  <8)  ||0  <8)  m)  (3.2) 

=  =  [  zn{dz\y)e{dy)  =  P 

Here  r  is  a  probability  measure  on  S'  x  S'  and  is  a  stochastic  kernel  on 
given  S.  The  fact  that  a  large  deviation  principle  holds  is  proved  in  [7], 
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although  they  do  not  identify  the  rate  function  in  this  form  but  rather  in 
terms  of  a  Legendre  transform.  One  can  give  a  direct  proof  of  the  large 
deviation  result  as  in  [1]  which  automatically  gives  this  more  concrete  form 
of  the  rate  function  (3.2),  which  is  analogous  to  (2.1).  See,  in  particular,  the 
analogous  prelimit  representation  formula  in  [1,  Section  4.4]. 

For  a  Borel  measurable  function  F  :  ^  M  U  {oo},  we  wish  to  numeri¬ 

cally  approximate  the  quantity 

Fly  exp{-nF  (X„)}  ,  (3.3) 

where  Ey  denotes  expected  value  given  initial  state  Yq  =  y.  The  special  case 
of  Py  {Xn  G  A}  is  obtained  by  letting  F'(x)  =  0  for  rc  G  ^  and  F{x)  =  oo  for 
X  ^  A.  Under  various  sets  of  regularity  conditions  on  F,  one  has  the  large 
deviation  asymptotic  approximation  [8,  1] 

-- log  Ey  ex.p  {-nF  {Xn)}  ^  inf  [F{P)  +  L{P)].  (3.4) 

n  /3£Rd. 

4  Properties  of  the  Relevant  Eigenfunctions 

It  is  well  known  that  certain  eigenfunctions  are  needed  to  construct  good 
importance  sampling  schemes  for  functionals  of  a  Markov  chain.  These 
eigenfunctions  are  used  to  essentially  “cancel  off”  the  effect  of  conditioning 
on  the  transition  kernel.  The  eigenvalue/eigenfunction  problem  is  to  find, 
for  each  a  G  M'^,  a  real  number  G{a)  and  a  function  r{-;a)  :  S  [0,  oo) 
such  that 

/  /  e^°‘'^'^r{y;a)m{dz\y)p{x,dy)  =  e^^'^^r{x]a). 

Js  JR'i 

A  key  fact  is  that  the  eigenvalues  may  be  defined  in  terms  of  the  Legendre 
transform  of  L.  This  is  defined  for  a  G  by 

H{a)  =  sup  [{a,P)  -  L{P)]  , 
and  is  again  a  convex  function. 

The  needed  properties  of  the  solution  to  this  problem  are  summarized 
in  the  following  lemma  [7,  Section  3]. 

Lemma  4.1  Assume  Condition  3.1.  The  following  eonclusions  hold. 

1.  For  each  a  G  there  exists  a  solution  {G{a),r{-;  a))  to  the  eigen¬ 
value/eigenfunction  problem,  with  G{a)  =  H{a). 
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2.  Let  a  compact  set  K  cM!^  be  given.  Then  there  is  6  G  (0, 1)  such  that 
6  <  r{y;  a)  <  1/6  for  all  y  G  S  and  a  G  K. 

3.  Let  a  compact  set  K  <Z  he  given.  Then  the  map  a  r{y;  a)  is 
uniformly  Lipschitz  continuous  over  a  ^  K  for  each  y  G  S. 


5  The  Isaacs  Equation  and  Subsolutions 

As  in  Section  2,  the  partial  differential  equation  associated  with  this  impor¬ 
tance  sampling  problem  is  the  Isaacs  equation 

Wt{x,t)+  sup  inf  M{DW{x,t);  a,  P)  =  0, 

aSR'* 

where  H  is  as  defined  in  (2.2).  We  recall  that  W  :  M'^x  [0, 1]  — >  M,  Wt  denotes 
the  partial  derivative  with  respect  to  t,  and  DW  the  gradient  in  x,  and  that 
{W,Oi)  is  a  sub  solution /control  pair  if  W  is  continuously  differentiable  on 
X  (0, 1)  with  a  uniformly  bounded  and  uniformly  Lipschitz  continuous 
derivative, 

Wt{x,t)+  inf  M.{DW{x,t);a{x,t),  P)  >  0, 

and 

W{x,l)  <2F{x).  (5.1) 

We  will  also  use  the  term  subsolution  to  refer  to  the  W  component  alone, 
and  will  sometimes  use  the  phrase  even  if  it  is  not  certain  that  the  terminal 
condition  holds.  With  each  subsolution/control  pair,  one  can  associate  an 
importance  sampling  scheme.  The  construction  and  analysis  of  this  scheme 
are  carried  out  in  detail  in  the  next  two  sections. 

A  companion  paper  [4]  describes  in  detail  how  to  construct  subsolu¬ 
tion/control  pairs  that  satisfy  the  terminal  condition  (5.1).  As  we  saw 
in  Section  2,  it  is  often  the  case  that  one  can  work  with  simple  subsolu¬ 
tions  (e.g.,  functions  that  are  affine  in  x  and  t)  for  certain  functionals  F, 
and  then  use  the  pointwise  minimum  of  such  subsolutions  to  handle  more 
complex  F.  Suppose  we  label  the  individual  smooth  subsolution/control 
pairs  (Wfc,  dfc).  A:  =  1, . . . ,  AT.  Let  W(x,  t)  =  Wfc(x,  t).  Since  IT  is  not 
smooth,  we  mollify  and  use  convexity  to  obtain  a  smooth  subsolution  de¬ 
noted  by  W.  If  a(x,  t)  is  defined  as  a  saddle  point  in  the  min/max  problem 

sup  inf  M{DW{x,t)]  a,  P), 

QgRd  /3eR‘^ 

then  (IT,  d)  is  a  subsolution/control  pair. 
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For  the  general  model  of  this  paper  the  individual  subsolutions  Wk  are 
often  affine  functions  in  {x,t),  and  each  oik  is  a  constant.  In  this  case,  the 
change  of  measure  associated  with  each  subsolution/control  pair  {Wk,  otk)  is 
fairly  simple.  For  example,  in  the  case  of  functionals  of  a  Markov  process 
we  need  only  compute  the  corresponding  eigenfunction  at  the  single  point 
Cik-  However,  the  subsolution/control  pair  (W,  d),  where  a  is  defined  by  the 
saddle  point  property,  produces  a  state  and  time  dependent  d,  and  thus  a 
scheme  that  could  be  signihcantly  more  complicated. 

As  discussed  in  Section  2,  a  scheme  which  preserves  the  simplicity  of  the 
affine  subsolutions  can  be  found  by  appropriately  randomizing  between  the 
individual  importance  sampling  schemes  associated  with  (H4,dfc)  accord¬ 
ing  to  the  mollification  weights.  Schemes  of  this  sort  require  the  following 
more  complicated  notion  of  a  subsolution/control  pair,  which  subsumes  the 
previous  special  case.  As  suggested  by  the  examples  given  in  Section  2,  it  is 
typically  the  case  that  =  {Wk)t  and  Sk  =  DWk  in  the  definition  below. 

Definition  5.1  The  collection  iW ,  pk,ak)  will  be  called  a  generalized  sub¬ 
solution/control  if  the  following  conditions  hold,  x  [0, 1]  — >  M,  A:  = 

1, ...  ,K  is  a  partition  of  unity,  i.e.,  each  pk  is  non-negative,  and 

K 

'^Pk{x,t)  =  1 
k=l 

for  all  {x,  t)  G  X  [0, 1].  The  functions  pk  and  Uk,  k  =  1, . . . ,  K ,  are  uni¬ 
formly  bounded  and  Lipschitz  continuous.  Wt  and  DW  have  representations 

K  K 

Wt{x,  t)  =  '^  Pk{x,  t)rk{x,  t),  DW {x,  t)  =  '^  Pk{x,  t)sk{x,  t), 
k=l  k=l 

where  each  rk  and  Sk  is  uniformly  bounded  and  Lipschitz  continuous,  and 
for  each  k  =  1, . . .  ,K 

rk{x,t)-\-  inf  M{sk{x,t);ak{x,t),f3)  >0. 
pmd- 

It  is  only  the  {pk,ak)  part  of  this  collection  that  will  be  used  to  define 
the  importance  sampling  scheme  (see  the  next  section).  As  noted  in  Section 
2,  a  key  measure  of  efficiency  of  any  importance  sampling  scheme  associ¬ 
ated  with  the  collection  (IF,  pk,  dfc)  is  IF (0,  0),  with  larger  values  of  IF (0,  0) 
corresponding  to  greater  variance  reduction.  The  design  problem  is  to  max¬ 
imize  IF(0, 0),  subject  to  the  constraints  that  {W,pk,ak)  be  a  generalized 
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subsolution/control,  and  that  the  terminal  condition  (5.1)  hold.  There  is 
considerable  flexibility,  and  an  appreciation  of  all  the  possibilities  for  even 
simple  problems  requires  some  experience.  These  issues  are  explored  at 
length  in  [4]. 

6  Importance  Sampling  Based  on  Subsolutions 

As  discussed  in  Section  2  and  many  other  places,  the  idea  behind  importance 
sampling  is  to  simulate  the  system  of  interest  under  an  alternative  distribu¬ 
tion,  multiply  the  sample  by  the  inverse  of  the  Radon-Nikodym  derivative, 
and  then  consider  the  sample  average.  The  main  issue  is  how  to  choose  the 
new  distribution  so  that  the  variance  of  the  estimate  is  as  small  as  possible. 
It  is  also  by  now  well  known  that  seemingly  reasonable  schemes  can  perform 
very  poorly  [5,  6],  whence  the  development  of  usable  tools  for  the  analysis 
of  variance  is  essential  if  the  method  is  ever  to  be  used  with  any  confidence. 
As  we  now  show,  the  subsolution  property  gives  strong  quantitative  control 
on  the  second  moments  of  the  estimates  under  the  scheme. 

Let  (IT,  pfc,  dfc)  be  a  generalized  subsolution/control.  We  recall  the  eigen¬ 
value/eigenfunction  relation 

/  /  e^“’^V(y;  a)m  (d^  |y  )p(rc,  dy)  =  e^^"V(x;  a). 

Js  JR'i 

It  follows  that  for  each  a  G 

P{yi,dy2,dz-,a)  =  ■  p{yi,dy2)  -midz  \y2)  (6.1) 

r{yi;a) 

defines  a  probability  measure  on  S'  x  These  probability  measures,  the 
weights  pk{x,t),  and  the  functions  6ik{x,t)  will  be  used  to  construct  the 
importance  sampling  scheme.  To  this  end,  let 

«fc,i(a^)  =  ak{x,j/n),  plj{x)  =  pk{x,j/n). 

Processes  and  6”,  analogous  to  Xj,Yj,  and  bj{Yj),  are  constructed 

recursively  as  follows.  Let  Xq  =  0  and  Yq  =  Yq  =  y.  Suppose  that  X^  =  x 
and  YJ^  =  yi  are  given.  We  then  simulate  6j_,_i)  under  the  distribution 

K 

Yl  {yi,dy2,  dz]  a^jix))  , 

k=l 
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which  can  be  thought  of  as  a  randomized  version  of  P{yi,dy2-,dz;a^ -{x)), 
with  the  weights  given  by  p'^-{x).  Finally,  ^”+1  ^  X'^+b'^^i/n.  An  unbiased 
estimate  for  £'exp{— nF(X„)}  is  then  obtained  by  averaging  replications  of 


Z 


n 


n—1 

n 

j=o  Lfc=i 


r{Yp;al.iX^)) 


(6.2) 


As  noted  previously,  the  numerical  estimate  in  importance  sampling  is 
the  sample  average  of  independent  replications  of  Z”.  Since  the  goal  is  to 
control  the  sample  variance,  it  is  enough  to  bound  the  second  moment  of  a 
single  replication. 


7  Statement  and  Proof  of  the  Main  Result 

In  this  section  we  present  the  main  result,  which  is  an  asymptotic  bound 
on  the  second  moment  for  importance  sampling  estimator  associated  with  a 
given  subsolution.  Although  both  the  quantity  being  approximated  and  the 
importance  sampling  scheme  depend  on  the  initial  state  Yq  =  y,  to  simplify 
the  exposition,  the  dependence  of  expected  values  on  y  is  not  explicitly 
denoted. 


Theorem  7.1  Assume  Condition  3.1.  Let  {W ,  pk,ak)  be  a  generalized  sub¬ 
solution/control  such  that  2F{x)  >  W{x,  1)  for  every  x  G  M'^.  Let  be  the 
second  moment  of  a  single  replication  used  in  the  corresponding  importance 
sampling  scheme  for  the  estimation  of  E  e-K.p{—nF{Xn)} ,  that  is, 

=  E  [{Z^f]  , 


where  Z”  is  defined  in  (6.2).  Let 


IF”  =  --logF”. 
n 

Then 

liminf  IF”  >  IF(0,0). 
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Outline  of  the  Proof  of  Theorem  7.1.  By  expressing  the  second  moment 
in  terms  of  the  original  random  variables,  we  can  write 


V 


n 


n—1 


^g-2nF(X„) 


K 

Y^PkjiXj)  ■ 


The  proof  is  divided  into  5  parts. 


1.  Representation.  We  replace  P”  by  an  upper  bound,  and  then  derive  a 
stochastic  control  representation  for  the  normalized  logarithm  of  this 
quantity.  This  prodnces  a  lower  bound  for  IT”. 


2.  Tightness.  Associate  certain  stochastic  processes  and  measure  valued 
processes  to  the  representation.  Under  the  assumption  that  the  costs 
in  the  representation  are  bounded  as  n  — >  oo,  show  that  these  processes 
are  tight. 


3.  Identification  of  limits.  Derive  characterizations  and  relations  between 
the  limit  processes. 


4.  Analysis  of  the  cost.  Go  back  to  the  representation,  and  analyze  the 
asymptotics  of  the  cost  using  weak  convergence. 


5.  Verification.  Finally,  use  the  Isaacs  equation  and  a  classical  verifica¬ 
tion  argnment  to  show  that  the  proper  asymptotic  bonnd  holds  for  the 
representation. 

The  chain  rnle  for  relative  entropy  (see,  e.g.,  [1,  Theorem  C.3.1])  will 
be  used  several  times  in  the  proof.  If  Si  and  S2  are  Polish  spaces  and 
£  P{Si  X  S'2),  then 


Rip,\\n)  =  R{[ij]i\\[n]i)  +  f  R{iJ,{-\yi)  \\n{-\yi))  [n]i{dyi)  (7.1) 

JSi 


7.1  Representation. 

Using  convexity  of  e*  and  the  definition  G{x)  =  W{x,l)  <  2F{x),  the 
second  moment  U”  is  bonnded  above  by 


yn 


n—1 


=  £;e-”G(W)  exp 

i=o 


K 


{al^{Xjfih,+i{Y,+i)) 


18 


Define 


-F(a^,^.(X,))+log 


r(y,+i;a^_.(X,))1  ] 
r{Yj;al.{Xj))  \j- 


=  --logV^. 
n 

Clearly  <  ID"'.  Therefore,  it  suffices  to  show 

liminflD"  >  1D(0,0).  (7.2) 

n^oo 

We  would  like  to  use  the  variational  representation  for  exponential  inte¬ 
grals  to  derive  a  stochastic  control  representation  for  ID".  Because  of  the 
unbounded  terms  ■{Xj),hjj^i{Yjj^i)^  and  G{Xn),  an  extension  of  this 
representation  is  required. 

Lemma  7.2  Let  X  be  a  probability  measure  on  a  measurable  space 

and  /  :  D  — >  M  a  measurable  function.  If  e~^  and  fe~^  are  integrable  with 

respect  to  X,  then 

-log  [  e“-^dA  =  inf  ji?(7||A)  +  [  f  d'ji  , 

Jn  T'  I  Jn  ) 

where  the  infimum  is  taken  over  all  probability  measures  7  for  which  the  sum 
on  the  right-hand- side  is  meaningful  {i.e.,  not  of  the  form  00  —  00). 

The  proof  only  involves  minor  changes  to  that  of  [1,  Proposition  1.4.2]  and 
is  thus  omitted.  It  is  easy  to  check  that  the  condition  for  this  representation, 
that  is,  the  finiteness  of  the  two  integrals,  holds  in  our  case.  This  is  due  to 
the  bound  on  the  moment  generating  function  of  the  bi{y)  and  the  assumed 
Lipschitz  property  of  G{x)  =  lD(x,  1). 

Once  one  has  this  general  relative  entropy  representation  for  exponential 
integrals,  it  is  easy  to  extract  a  more  useful  form  by  a  standard  argument. 
Consider  the  total  distribution,  say  A,  of  the  component  random  variables 
used  to  construct  the  process  [here  the  D  and  bi{Yi)],  and  write  the  expec¬ 
tation  in  terms  of  an  exponential  integral  against  this  distribution.  Apply 
the  relative  entropy  representation  to  this  exponential  integral,  and  let  7  be 
the  probability  measure  introduced  by  the  representation.  Now  factor  both 
the  original  distribution  A  and  the  new  probability  measure  7  as  a  product 
of  conditional  distributions.  For  example,  if  A  were  a  distribution  on  it 
would  be  factored  as  [A]i((ia;i)[A]2((ix2|xi)[A]3(da:3|a;i,  X2).  One  then  decom¬ 
poses  the  relative  entropy  according  to  the  chain  rule  (7.1),  giving  rise  to 
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a  relative  entropy  cost  for  the  perturbation  of  the  conditional  distribution 
of  each  component  random  variable.  Finally,  for  convenience  one  writes  the 
right  hand  side  of  the  relative  entropy  representation  in  terms  of  this  decom¬ 
position  and  random  variables  distributed  according  to  the  new  probability 
measure.  Since  the  analogous  elementary  proof  appears  in  many  places  (e.g., 
[1,  Theorem  B.2.2]),  we  simply  state  the  final  result.  Consider  a  collection  of 
stochastic  kernels  and  where  jij  is  allowed  to  depend  in  any  measur¬ 
able  way  on  {bf,  0  <  i  <  j}  and  0  <  f  <  j  -|- 1},  i/”  is  allowed  to  depend 
in  any  measurable  way  on  {6”,  0  <  i  <  j}  and  {Y^^,  0  <  i  <  j},  and  and 
choose  the  conditional  distributions  of  and  YJ^^,  respectively.  To 
simplify  the  notation  the  dependencies  of  and  u'j  on  the  past  will  not  be 
made  explicit.  Let 


=  E 


n—1  K 


j=0  k=l 


+  h  bj+l)  -  H  [akjiXj  )j  +  log 


(7.3) 


Then  IT”  =  inf  J{iY',  y”),  where  the  infimum  is  over  all  such  collections. 


7.2  Tightness 

To  analyze  the  asymptotics  of  IT”  we  first  establish  the  tightness  of  the 
processes  that  appear  therein.  For  j  =  0, . . . ,  n  —  1  and  t  G  [j/n,  (j  -|-  l)/n) 
define 


idy2\t) 

lY'  {dz  |t) 
0”  {dyi  X  (iy2  |t) 
7”  {dyi  xdy2\t) 
C  (dy  xdz\t) 
rf'  {dy  X  dz\t) 


J 

{dy2) 

{dz) 

dyn  {dyi)  {dy2) 

3 

dyr.  {dyi)p{yi,dy2) 

3 

dyu  {dy)  Hj  {dz) 

j+i 

6yn  {dy)m{dz\y) 

j+i 
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and  let  left  continuity  define  these  processes  at  f  =  1.  We  also  set,  for  Borel 
subsets  A  c  S  X  S  and  B  c  [0, 1], 

e^{AxB)=  [  e^{A\t)dt. 

Jb 

Then  0”  is  a  random  probability  measure  on  space  {S  x  S)  x  [0, 1].  Define 
in  the  analogous  fashion  random  probability  measures  /x"",  7”,  and 
on  spaces  S  x  [0,1],  x  [0,1],  x  S)  x  [0,1],  {S  x  x  [0,1],  and 
{S  X  X  [0, 1],  respectively. 

Lemma  7.3  Assume  Condition  3.1  and  let  (W,  pk,  dfc)  be  a  generalized  sub¬ 
solution/control.  Consider  any  subsequence  and  collection  = 

0, 1, . . . ,  n  —  1}  for  which  the  expected  cost  J{pA,  as  defined  in  (7.3)  is 
uniformly  bounded  from  above.  Then  (with  the  supremum  on  n  restricted  to 
elements  of  the  subsequence) 


lim  supE 

C — >00  yl 


b]\\>C} 


=  0, 


the  collection 

is  tight,  {X"'(l)}  is  uniformly  integrable,  and  {p"'}  is  uniformly  integrable 
in  the  sense  that 


lim  sup  LI 

G^oo  ri 


\\y\\  h\\y\\>C}T'^{dy  x  dt) 


=  0. 


The  proof  of  the  lemma  is  given  in  the  appendix.  However,  it  is  worth  noting 
that  the  first  estimate  is  the  key  result,  and  that  the  tightness  and  uniform 
integr ability  follow  easily  from  this. 

In  order  to  show  the  desired  lower  bound  (7.2),  all  we  need  to  show  is 

liminf  >  IH(0,0).  (7.4) 

n—^oo 

for  any  sequence  {(/x”,  r'J'),^  =  0,...,n  — 1}.  Abusing  notation  a  bit,  assume 
from  now  on  that  =  0, . . . , n  —  1}  is  an  arbitrary  subsequence 

such  that  the  cost  is  uniformly  bounded  from  above.  Clearly,  we 

only  need  to  show  inequality  (7.4)  along  every  such  subsequence. 


21 


Owing  to  the  positivity,  boundedness,  and  Lipschitz  properties  of  the 
eigenfunctions  and  ak  (see  Lemma  4.1),  there  exists  M  <  oo  such  that  for 
all  y  G  5,  k  =  1, . . . ,  K,  n  €  Z+,  j  G  {1, . . . ,  n},  xi  G  and  X2  G 


r{y,akj  (x2)) 


l^^r(y;ak  (xi,  (j  -  l)/n)) 
r(yak  (x2,j/n)) 


<  M{\xi  -  X2I  +  l/n). 


Thanks  to  the  first  part  of  Lemma  7.3,  for  any  (5  >  0  and  along  this  subse¬ 
quence  with  bounded  cost. 


lim  sup  E 

n—^oo 


=  0. 


Therefore,  the  Lipschitz  properties  of  the  pk  that  are  part  of  the  definition 
of  a  generalized  subsolution  and  the  definition  ^”+1  ~  imply 


lim  sup  E 


n 


n-l  K 


j=0  k=l 


log 


<  lim  sup  E 

n—^oo 

+  lim  sup  E 


n  K 


j=l  k=l 

1  K 


log- 


r{Yp-al^{X^)) 

zYYVwyY)  -  fUY) 


j=0  k=l 


logr{Yp+y,alj{X^)) 


=  0. 


Thus  we  need  only  prove  the  lower  bound 

n-l  K 


lim  inf  E 

n— >00 


1 


n 


YYfUY)  [« ("j”  (■)  ||">(-ir”+i))  « (''”(■)  ||f>(y”,  ■)) 


j=0  k=l 


+  -  H  [al,{X^) jj  Y  G{X^) 


>  1L(0,0). 


Note  that  the  relative  entropy  terms  do  not  depend  on  k,  and  so  they  can 
be  moved  past  the  corresponding  sum.  Thanks  to  the  uniform  boundedness 
and  Lipschitz  continuity  of  pk  and  d^,  the  uniform  integrability  of  {p^} 
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(Lemma  7.3),  and  the  chain  rnle  for  relative  entropy  (7.1),  all  we  need  to 
show  is  the  lower  bound 


liminf  J""  >  VL(0,0), 


(7.5) 


where 


E 


K  1 
k=i 


+  V  /  Pk{X^{t),t)  (ak{X^it),t),  z)  {dz  X  dt)  +  G(X-(1)) 


In  order  to  show  (7.5),  we  need  to  identify  limits  of  the  involved  processes. 
That  is  the  goal  of  the  next  subsection. 


7.3  Identification  of  the  Limits. 

Lemma  7.4  Assume  Condition  3.1,  and  consider  any  subsequence  along 
which  J(/x"',  i^"')  is  uniformly  bounded  from  above  and 

-  {x,n,p,e,^,c,p) 

in  distribution.  Then  the  following  conclusions  hold.  Each  of  the  measures 
n,  (for  example,  v)  can  be  factored  in  the  form  v  {dy  x  dt)  = 

n  {dy  \t)  dt,  where  dt  is  Lebesgue  measure.  Furthermore,  w.p.l 

X{t)=  /  zp{dz\s)ds, 

J[0,t]  iRd 


and 


j{dyixdy2\t)  =  u  {dyi\t)p{yi,dy2) 

p{dyxdz\t)  =  n  {dy\t)m{dz\y) , 

[9]i{dy\t)  =  [9]2{dy\t)  =  n{dy\t), 
[C]i{dy\t)  =  I'idylt),  [Ch{dy\t)  =  p{dy\t). 


Proof.  The  fact  that  the  t-marginal  of  the  random  measures  is  Lebesgue 
measure  follows  from  the  weak  convergence  and  the  fact  that  the  same  is 
trne  of  the  analogous  prelimit  measures.  Also,  the  existence  of  the  factored 
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form  is  standard,  and  follows  from  the  same  sort  of  arguments  one  uses  to 
prove  the  existence  of  regular  conditional  distributions  [1,  Lemma  3.3.1]. 

We  next  consider  the  representation  for  X,  and  use  an  argument  similar 
to  that  of  [1,  Theorem  5.3.5].  For  any  time  t  of  the  form  j /n,  0  <  j  <  n,  we 
can  write 


X^ij/n) 


zfi'l  {dz)  +  M”  (j'/n) 


zf/^  {dz  X  dt)  +  M""  (j/n) 


where 


zuf  {dz] 


is  a  martingale.  Fix  6  >  0,  and  define  random  variables  and  random  mea¬ 
sures 


=  ^^]idz)l{\\z\\>nS}  +  Hdz)f^]{{\\z\\  <  n<5}), 

where  6o{dz)  is  the  probability  measure  with  mass  1  at  zero.  It  is  not  difficult 
to  see  that  A”  gives  the  conditional  distribution  of  whence 


=  /  zX2{dz) 


is  also  a  martingale.  By  a  standard  submartingale  inequality 


P<  max  jjM”(j'/re)  —  A^"'(j7n)ll  >  e 


-  <r2 


E 


1 
n 

n—  1 


n—1 


j=0 


rPe 


j=0 


■,+l||<n^} 


ZIJ.]{dz)l{\i^\\^^S} 


2n 


< 


i=i 


^l^{\\b-\\<nS} 


< 


n£‘ 


E^ 

1=1 
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'1 

■ 

n 

bl 

i=i 

By  Chebyshev’s  inequality  and  a  conditioning  argument, 


P  <j  rnax  \\N^{j /n)\\  >  £ '>  <  P  <  —  1 1 b 

J  '”i=l 


<  -E 


1  ”  II 

;7E  fe 


i=i 


6: 


n— 1 


-E  / 


-{P"||>n.}^^/2 

ll^ll  li^{dz)  >  e/2 

2||>n6} 


1  ”  II 

-E 

n  ^  II  ^ 

l{||6~||>n^} 

i=i 

=  -P 


The  last  quantity  tends  to  zero  as  n  tends  to  infinity  for  each  fixed  <5  >  0  by 
Lemma  7.3.  Sending  first  n  — oo  and  then  5  — >  0,  it  follows  that  for  each 
e  >  0 


P<  max  ||M”(j7n)||  >  2e  >  0 

7=1, ...,n 


as  n  ^  oo.  Thus 


X^U/n)  - 


rj/'T- 


zi/^{dz  X  dt)  ^  0 


uniformly  in  j  G  {1, . . .  ,n},  in  probability.  Using  the  nniform  integrability 
and  weak  convergence  of  jE  we  justify  the  limit 


m 


zfi{dz  X  ds)  =  0 


■JO 


for  all  t  G  [0, 1],  w.p.l.  When  combined  with  the  factorization  n  {dz  x  ds)  = 
fi  {dz\s)  ds,  this  proves  the  representation  for  X. 

Finally,  we  discuss  the  formulas  for  the  limit  measures.  These  all  follow 
easily  from  analogous  properties  of  the  prelimit  measures.  For  example. 
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consider  the  random  probability  measure  0”.  Let  g  be  an  arbitrary  bounded 
continuous  function  on  S.  By  definition, 


where  In  is  an  error  term  with 

2 

\  I n\  ^  1 1  S'  1 1  oo 

n 

almost  surely.  Fix  arbitrary  e  >  0.  Let  Nq  G  N  he  such  that  \In\  <  s/2  for 
all  n  >  Nq.  Since  is  the  conditional  distribution  of  by  Chebyshev’s 
inequality  and  a  conditioning  argument,  for  n  >  Nq 


P 


9{y)[d'^]i,3{dy  X  dt) 


lsx[0,l] 

<  p 


<  Xe 


n—1 


'i+L 


9{y)  [d'']2,3  (dy  X  dt) 
>  s/2 


/Sx[0,l] 

-  [  g{y)i2^{dy) 


>  e 


j=o 

n—1 


^  -  J^9{y>j{dy) 


< 


leilsl 


2 

OO 


i=o 


s^n 


By  Fatou’s  Lemma 


P 


/  9{y)  [0]i,3  (dy  Xdt)-  /  g{y)  [e]^^  {dy  x  dt) 

/Sx[0,l]  ./Sx[0,l] 


>  s  ^  =  0. 


Thus  [0]i,3  =  [0]2,3  almost  surely.  Since  [^"']2,3  = 

[d]i,3{dy  X  dt)  =  [6]2,3{dy  x  dt)  =  v{dy  x  dt)  =  i>{dy\t)dt, 

which  proves  [6]i{dy\t)  =  [9]2{dy\t)  =  iy{dy\t). 

With  regard  to  the  decomposition  of  7,  an  analogous  argument  shows 
that,  for  any  e  >  0  and  bounded  continuous  functions  51,52  on  S,  we  have 


0  =  lim  P 

n—^oo 


/52x[0,l] 


9iiyi)92{y2)'y'^idyi  X  dy2  x  dt) 


'S2x[0,l] 


giiyi)92iy2)i^'^{dyi  X  dt)p{yi,dy2) 


>  s 
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However,  by  the  Feller  property  the  mapping  yi  Jg  g{y2)piyi,dy2)  is 
bounded  and  continuous.  The  decomposition  of  7  now  follows  from  the  weak 
convergence  of  y""  and  z/”,  Fatou’s  Lemma,  the  arbitrariness  of  e,  and  the 
fact  that  product  functions  are  convergence  determining  (see,  for  example, 
[1,  Theorem  A. 3. 14]). 

The  expressions  for  and  rj  can  be  proved  in  the  same  way,  and  we  omit 
the  proof.  ■ 


7.4  Analysis  of  the  cost. 

We  claim  that  liminf^^oo  J"'  [see  equation  (7.5)]  is  bounded  below  by 


E 


R{9\\j)  +  R{C\\rj)-Y^  /  pk{X{t),t)H{ak{X{t),t))dt 


tV]  /  Pk{x{t),t)  (ak{X{t),t),z\  p{dz  X  dt)  +  G{X{1)) 


The  bound  for  the  first  two  relative  entropy  terms  follows  from  the  weak 
convergence,  Fatou’s  Lemma,  and  the  lower  semicontinuity  of  relative  en¬ 
tropy  [1,  Lemma  1.4.3].  The  convergence  of  the  next  two  terms  follows  from 
the  weak  convergence,  the  continuity  and  boundedness  properties  of  the  pk 
and  Oik,  and  the  Dominated  Convergence  Theorem.  Lastly,  we  show  that 


lim  inf  E 

n^oo 


G(A”(1)) 


>  E 


G{X{1)) 


(7.6) 


Indeed,  by  the  Lipschitz  property  of  W,  there  exists  C  >  0  such  that 


G(x)>-C'(llxll  +  1). 


By  Fatou’s  Lemma, 


lim  inf  E 

n^oo 


G(A’^(1))  +  C'11A'^(1)11 


>  E 


G(A(1))  +  C'11A(1)11 


Since  the  uniform  integrability  of  {A"'(l)}  proved  in  Lemma  7.3  implies 
lim^^oo  .^||A”(l)jj  =  .FjjX(l)jj,  the  inequality  (7.6)  follows. 

Using  the  factorization  properties  of  relative  entropy  (7.1),  we  now  do 
some  rewriting  of  the  various  terms.  We  have 

-^(^117)  =  [  R{0{dyi  X  dy2\t)\\j{dyi  X  dy2\t))  dt 

Jo 

-^(Clk)  =  [  R{C{dy  X  dz\t)\\p{dy  X  dz\t))  dt. 

Jo 
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However,  by  Lemma  7.4  [0]i  {dy  |t)  =  [6]2  {dy  |t)  =  v  {dy  |t),  ^{dyi  xdy2\t)  = 
u{dyi\t)p{yi,dy2),  r}{dyxdz\t)  =  v{dy\t)m{dz\y),  a.Tid(,{dyxdz\t)  =  v{dy\t)q{dz\y,t) 
for  some  stochastic  kernel  q.  Since 


JS 


zq{dz\y,t)u{dy\t)  =  /  zC,{dy  x  dz\t) 

JsxRd. 


z[C]2{dz\t) 


=  /  ziJ,{dz\t) 

jR'i 

=  P{t), 


it  follows  from  the  definition  of  L  in  (3.2)  that 

R{9\h)  +  R{C\\r,)>  [\{P{t))dt, 
Jo 

Moreover,  the  definition  of  P{t)  gives 


/]Rdx[0,l] 


a{X  (t) ,  t) ,  z)  p  {dz  X  dt)  =  /  P{t))  dt. 

Jm 


We  thus  obtain  a  lower  bound  for  lim  inf„^oo  •7"'  in  the  form 


T  =  E 


JO 


1  K 

'^Pk{Xit),t) 


k=l 


L{P{t))-H{ak{X{t),t)) 


+  (akiXit),t),P{t) 


dt  +  G{X{l)) 


7.5  Verification. 

We  now  do  a  classical  verification  argument  to  show  L  >  W(0,0).  By 
assumption  (see  Definition  5.1), 


Wt{Xit),t)  +  {^DW{X{t),t),P{t) 

K 

=  '^Pk{X{t),t)  rk{X{t),t)  +  (^Sk{X{t),t),P{t) 

k=l 
K 

>  '^Pk{X{t),t)  L{P{t))  +  (^ak{X{t),t),P{t)'j  -  H{ak{X{t),t)) 

k=l 


28 


Integrating  both  sides  from  0  to  1,  and  nsing  the  fact  that  /?(t)  =  dX{t)/dt, 

-  „i  K 

E  /  '^pk{X{t),t)  L{P{t))  +  (^ak{X{t),t),P{t)'j  -  H{ak{X{t),t))  dt 

k=i 

>  W{0,0)-EW{X{1),1) 

Since  G{x)  =  W  (x,  1),  upon  bringing  this  term  to  the  left  hand  side  we 
obtain  F  >  1^(0,  0),  thus  completing  the  proof  of  Theorem  7.1.  ■ 


8  Appendix 


Proof  of  Lemma  7.3.  The  proof  uses  ideas  from  [1,  Proposition  5.3.2]. 
We  start  by  observing  a  few  facts,  namely,  that 


/ 1  ^ 

-2G(x:)<2ci-Y,m  +1 


that  the  eigenfunctions  r(y;  a)  are  bounded  uniformly  from  above  and  below 
away  from  zero  on  {a  :  ||a||  <  C},  that  H{a)  is  bounded  from  below  on  this 
set,  and  that  relative  entropy  is  non-negative.  These  imply  the  existence  of 
Cl  <  oo  and  C2  <  oo  such  that 

'  -  n— 1  1  ^ 

supC  -Y^rU{.)  <C2,  (8.1) 


where  the  supremum  is  over  the  same  subsequence  as  in  the  statement  of  the 
lemma.  It  follows  immediately  that  (•)  <C  for  alH  =  0, . . . ,  n  — 

1,  with  probability  one.  We  can  find  non- negative,  measurable,  random 
functions  /”  such  that  /”  is  a  measurable  version  of  dp"^  {■)  / dm{-\Yl^-^) . 
We  use  the  fact  that  for  all  o  >  0,  c  >  0,  and  p>  I, 

ac  <  -  (c  log  c  —  c  -f  1) . 

P 

Since  clogc  —  c  -|-  1  >  0,  it  follows  that 


-  n-1 

<E-J2  \\z\\fr{z)m{dz\Y:^^,) 

n  jRd 

<  E  -Y,-  /  {f't{z)\ogfl 


f{z)-f:{z)  +  l)m{dz\Y:i^) 


+  -Y[  . 
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Under  Condition  3.1,  for  each  p  there  is  a  finite  and  uniform  bound  B{p)  on 
/iRd  {dz  ly)  for  all  y  G  S.  This  allows  us  to  continue  the  inequality  as 

'  -  n  “I  1  r  1 

.  i=l  J  ^  L  i=0 

Choosing  Ij p  =  2Ci  and  rearranging  (8.1), 

1  1  ^  ^  X  1  \ 

-,upE  -EfllA-nolKfe.))  <c,  +  b(^]. 


By  a  very  similar  argument  to  that  just  used,  we  find 


Under  Condition  3.1, 


sup  [  l||u||>c\e^ll^llm  (dz  |y)  <  e  ‘"sup  [  ^  0 

y&S  jRd  ~  y&S  jRd. 

as  (7  — 5>  oo.  Since  we  already  have  a  uniform  bound  on 

'  -  n— 1 

®  -E''(#‘f(')|h(T.+i))  . 

.  i=0 

the  first  part  of  the  lemma  follows  by  first  sending  C  — >  oo  and  then  p  — 5>  oo. 
We  define  a  piecewise  linear  process  X"'  by  setting 


dX^jt) 

dt 


6”  for  t  G 


Then  X”  is  the  piecewise  linear  interpolation  that  agrees  with  X""  at  times 
of  the  form  i/n,  and  hence  if  X""  converges  in  distribution  in  the  sup  norm 
to  a  limit  X  then  so  does  X”",  since 

sup  X”(f)-X”(f)  ^0 

0<i<l 
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in  probability  as  rr  — >  oo.  Therefore,  in  order  to  show  the  tightness  of  {X'^}, 
it  suffices  to  show  that  {X"'}  is  tight.  To  this  end,  define  the  modulus 

w^{6)=  sup  -X'^(s)||  . 

{s,tS[0,l]:|t— s|<(5} 


Tightness  of  {X^}  will  hold  if  for  each  £  >  0  and  rj  >  0  there  is  5  G  (0, 1) 
such  that  for  all  n 

P  {w"'{6)  >  s}  <  rj. 


Choose  C  <  oo  such  that  for  all  n 

1 


E 


n 


E 


i=l 


M>c} 


<  W2, 


and  let  6  =  [ejlC')  A  1.  Then  since  Cb  <  e/2 

r  r'^^lldX'^ir) 


P{w^{6)>e}  <  P 


sup  / 

{s,tS[0,l];|t— s|<^}  J sAt 


<  P 

<  P 

<  Ie 

£ 

<  rj. 


sup  / 

{s,ts[0,l];|t— s|<5}  .) sAt 

1  "(iX^(r) 


dr  >  £ 


1  r||  dx^jr)  II dr  >  e/2 


'0 


n 


dr 


1  fl  dXn(r)  ||>cl 


2=1 


yt 

As  for  the  uniform  integrability  of  {X”(l)},  observe  that  for  every  (7  >  0, 


i=l 


<c+-y\ 

n  ^ 

2=1 


|>C}- 


This  implies 


l^"'(l)l|l{||X»(l)||>2C}  -  ^^{|iX"(l)|l>2C}  + 


+  /E 


i=l 


m>c} 


/  ll^ni)IL 

-  2  ni!A-(i)|i>2C}  „ 


1  ”  II 

+  -E 

71  f  ^ 


2=1 


1{P^||>C}> 
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or 


l"^"'(l)l|l{||X"(l)||>2C}  - 


i=l 


|fe»||>C}’ 


which  in  turn  implies  the  uniform  integrability  of  {X"'(l)}. 

The  tightness  and  uniform  integrability  properties  of  the  random  mea¬ 
sure  {iJ.'^{dy  X  dt)}  is  easy.  Indeed, 


E 


/Rdx[0,l] 


bll  h\\y\\>C}l^''{dy  X  dt) 


Uniform  integrability  holds  since  the  last  quantity  tends  to  zero  uniformly  in 
n  as  C  — >  oo,  and  the  tightness  is  a  consequence  of  the  uniform  integrability 
[1,  Theorem  A. 3. 17].  ■ 
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